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Atomn 

BayciiAo Bux^auffiA wodelj have alMut) n^eriority in >af>oiis prvucal i^^nuofu, kiKb u wu 
cuie^Q(i2Jl*oa. coIbBoraUxt ptedKGoe.uiciul nctaork bbk ptedacuwi and crowdsourcins, aod ibey eoe* 

Joio ihc Scubiluy of BaynijA rmdHin; and pfcdKii^e sueosUu of nax>tnarpA leaniog Houoer, 

Mofiu Carlo sajiipbtig for i!w>4 nio<kh still reouifit cbaJIcngieg, ctpmaUy for applicatioos that ie> 
volve large-acaJe dauseK le this paper, t«e pretrfii i!tf siocittsijc sobgradiaet HanultoAias Mooie Carlo 
iHUC) meibods, ssluel) are r^> lo icaplanwet and conpuuuocaliy effinsu. We rbou die approximate 
dcLuled balance properly of stibgradieet HMC a'bcli reveals a eaural andvaUdaied gcneraJiaaiionof Uk 
ordinary HMC. PtirUtcnnore, *c lovestigeie tlie variasis tbar use stodiMk suhsampbngaod ibennostais 
for benar soalabilny and ausing Usiag stoebasue subgradwni Markov Ouio Moote Carlo (MCUCj 
we afScieatly solve ibe posterior irtfrreooe ual: of vanoiu Baycsun nax>margia nKpdels and exietejvc 
experiiueAtal results demoesirau die effecuvrttu of our appeoacL 

1 lulroduction 

Bdytjiifto CDax^fTuufan niodeLs base tetn shown to be \tiy rffmv'e in coaoy rui^orJd applicMlonis 

neb tsu anolysii I30> . coliabontive precbctjco 1351 . social ociwcxfe bofc predKUon nyt aod 
soujrinf 1341 Sbch BMM models conjoin ihe adviinia|es of the discriiisiniitjvc cQax^mvpji knnuo| and 
l^eaible Sayeaiiui mod^Ji^ and they achjeve the besl of she both worids: obtaining Ok flexibibty from a 
Boycniao moc^I aod meanwhile doing disenminatrve max^coafgio leanung^ through a newty^eveloped uju« 
bed BnyesiAO inTerence Irusework^ reguliui^ed SaycMiui toference (RegBayesI HTI 

In order to deal with large-scaJe dainaets^ developing ctfecove and sc id able tofrrcrKC coethoda is accuoal 
problem for Bayeaian coax^marpn mcdeh« which ia becoming n norm in many appUcaiion aieos Pre>(« 
00 % viinattooal^approaimation^iasecl infereoce methods; iur rniaed to solve the BMM models with menn« 
brJd asAiTDpuons cn poatmor diwibuoons POj SVhen the BMM models use ncnpariiinetnc Bayesian 
pooev soch vartaocoiiJ method:^ need lo ad^t the model ffunenbon lo Aiuab the variiiiiooal approiima^ 
boo 13711431 Moreover in auch inference schenK^ sdvtog support vector machioe (SVM) ^tibprobJems is 
Ume<oniucQjng« which motivated the further developments of the Cibbs classilier formulaijon iind the data 
augmentatton^aaed Olbbs sampler C71IHII42I 

Id Bayesian inference^ if we use a con/ugate prior (w nta pven lilebhoodi. weeanenaily derive the clo^e^ 
form posienor 1121 HoM^t^n the BMM models ore umaUy norK^jogaie due to the non«smoothness of Ibe 
htoge losa^ is often uivolved in ao unnormaliaed pseudc^hkelihood. The straightforward GiUis sam« 

pier II not appLcabte due to the oon<oojugiKy With a oewly discovered duu augmentation technique 1261 . 
the ougcoented Gibbs sampler achieves accurue posienor sampling and is miocoboo^Are for nooparamet^ 
r\c BMM models f37ll3B However^ titf Gibbs samplera with dato ougmentaiion ore not eCbcieot either in 
bt|b-dimeneional spaces as they often involve iirveting large matrices l55i . Moreoven the beoebt of inir<y 
ducing extra V4in0b]es would be couateracted tn the view o( the esiru compulation on dealtog with the extra 
sampling vanities 1^51 . 


I 


lo thii ptip^c pf^oi tttf lubswUent^basrd H^nulicnjiin Carlo iHMC) mthodx tor BMM 

modeli^ whjch dir^Uy draw sainpln fnm the onguiat posterior instead of Ok augmented one After udopt^ 
m| socoe ciuld coodioons of the poslenor fuocDon^^ we show the approuiute deiiulod beluce property for 
rubgmdteot HMC sethsxls Thtu usng slocba^Uc subgradiesl esticnaUoo 1771 [?3t ^ wr ftirOser develop Ibe 
noebfisue rubgrodteol MCMC (or fast compulation. By anneoling the dtscrenaUon step^iaes properly, our 
swbastje subcndient MCMC methods approx imatety converge to Ok target posteriori of basic Bayevian 
^VM tatrly eOicKntly Toi^iply stochnst^ subgrodrent MCMC on tvro diR'erent typei o( BMM models with 
laieoi Venables, we design two different mferervee algonOrms for latent emicture discovery, including a ncD« 
parareetJic BayeMon sodei Oor siochatuc subgradient MCMC can achieve djwuUically fa^t sampling and 
meanwhile draw accurate posterior samples We carry out extensive err^nrtcoJ studies on large-scale appli¬ 
cation! to show the effeenvenese and scalability of the presented stochastic subgrodreot MCMC meOiodi for 
BMM models 

Vit note that Oiere have been sevool previous attempts of using suhgradient tofonnolioo m KMC or 
Laogevto Monte Carlo 1771 [351 . yet our work stands os a 6m close uivesfigation. tn which we give the 
iheoreticoJ guarantee and cony out systematic studies oo th^ Ucchastic subgradient MCMC for Bayevian 
miix-margin learning. 

2 Preliminaries 

We luat briefly review the Bayeaiao max-coargia models with Gibbs ciossiflerv Then, we introduce ihe boek- 
grouod Ijiowledge ofOie tnfereoce methods. locJudini HamOioruoD Monte Carlo (KMCl and its ccleosioo. 
as well oi stDchosoc gradient Hanultauao Moata Carfo. 

2*1 Boyesian Max*mdri2bi Models 

Wiib the generic (rameworb of we coo design more flexibJe Bayesian models by adding 

proper regaJartraoon on the target posterior. Namely, after adding posteocr regulariaaoco lo a hiocomial- 
optimuation-refcrmulated Bayestoo model, a model geoerolly solves the following problem. 

luf KL(9fAd)||ir(M))-B,(logrtt5|.V1)l + r^«( 9 ). il) 

t^M)* r 

where .Vf denotes the mode) (pararueters). ^ is ihe feasible space of probobxiity diMnbotioas^f^^Vfl: KL | 9 (^)||e(^)) 
IS Ok KL drvergeace from Ok target posterior to the pnor irlAI]: is tbe observation datiiset: e la a 
nonnegatite regulariaaboo parameter and ^( 4 ) is a w^ll-destgned regulariaoboo term on y. It is not bard to 
show (f equ£js lo 0. Ihe sclutica of problem it ihe Bayes posterior9(«V]) cc rT|«M \p[t^\/A I If t* is 
not zero, we have an extra dimension of freedom lo introduce sick lofocmation into the inference pnxeAire 
through Oie pmenorregulanration term S49) l^rexample. when Uk regulanaabon H is defined os a hmge 
ios% in supen'ised learning tasks, such models turn out to be Bayesian max-morpn riKideia aod they 

succeiihilly incorporate Ok hexibihly of Bayesian models ood Ihe mar-margtn classiflers. This strategy has 
demonslraind promising perfonaonce m vanous tasks, (ocluding icxt clasMBcaboaaod lope extracbon TOB . 
social network oaalyais 1391 . and cooinx CoctonzaDon 1361 . 

lo this paper, we consider two examples of Bayesian max-morpn models with latent van tdiles. including 
lAoi-Aicr/gr/i ivpsc mode/ (MedLDAi and snftntfe SVU oSVMi |S7]. But our methods cao be opfriied 
lo other BMM models. SpeciAcaQy. MedLDA uses a lopic model 10 And tbe loieot topic raprvseataboos 
of the douunients aod ves a max-margin dassiher 10 do dociimeot clasaibcsDoiL Infuitfc SVM seaenJIy 
uses a Bayeitan neaparometne Dirichlet process prior to describe data moJu-inDdahty aod meonwbtle uses 
miix-marginclassiflers lo do discriminative tasks More details of these tsroexorDplei will be provided al^g 
Ihe development ol* the proposed fast samplers tor tbtf m 


2.2 BMM toodoh with g Clbbi cla&^licr 


In Use S4fpcr\)sed learmng ^eiun|« there are (enenlly rwo types cf cla&biAer^ UuU can be used wrih a Bayesian 
imdel 10 detuie a BMM nsodeL oamely^ expected claasiAera aod Cibbs claasiAeri. lo this pajt« we pve Ibe 
tntroductcM of the two tormulatsons and anaJyae the meeila of ctttcajng GiUis cUssthers. 

Let s be a gSeo lnitotn( ^ Kor each Hiiig point denotes tbe input 

leaoires ts the ccBTCspoodiog InbeL wtuch can be binary or ould^^oed. To build a classidei; a 

Baye^iao max^maQsin model can either use Ibe input (ealurei or leant a set of Luent fearures We tme to 
denote the fearurea are Ai into o claaaiAer. We cooaider the Lnear desaiber paraseteriied by Thei if 
the labeli are binary^ the pcedictioo rule i% debned us 

id =«n [/(•7r*i)]. /(fj-irf) (2) 


where 6 pi( ) ts the si^n Aiococdl 

For tbe above ^etun|. ao cicwfiif learns a postcncF dietiibuhoo in 0 hypothesis space of 

clanaihen that the ^^weigbled clussiAer s will have tbtf smallest possible nsk^ whsch 

IS typically opprounaied by Use traroing error like ^ where ts an codicajor 

function thai eejuaJs to I IT predicate hoUs otherwise 0. We deAne ihat S^|/(f7^r4l| f ^ niax(U« / ^ 
is the hinge loss fuoction with regard to data point ci nnd IHs the cost of making a 
wrong prediclioo. Tben« we can use Ok Ary&jyrj formulatloa (BqnjT^ to delioe a BMM mo(kl with an 
eipeoed classiAer by choosiog the Joss term H s ^ Inwwo that the binge loss 

1i upper bounds the oausing error 

Alteniotivel>\ the Otbbs vlnssifler dnw\ u classiAer according lo ^{r}) and uses il to do ctsssiAcatton^ 
which u proven to have oice generalization performance lo (he Gibbs classihen the correspooding 

Joss IS the aptiitJ huff^c /arr« 

U 

= (3) 

4ml 


Since the hiage loss function L ts convex, we can show that is an upper bound of osing lenseo's 
incqoalsty^ 

B,(i f w. ;(Tj. )] > n M. E, (/{17. )i 1 . i 4 ) 


Then^ the er/ierred/unxeiorr ts also Ok upper bound of Ok expected traintog error of the Gibbi clasiiAer 

^ M)]< Tberet'cm. the Otbhs c/orrt^e formulation gives a more retuxed model while 
ai tbe cune tune can nbtaia tuscecliuoty because we draw a single model fer each Ume lo addition^ with 
Gibbs classiAen^ ouncaboo^free saroplini can be performed (or BMM models with Bayettan nonparuroetne 
pnors» which is more accurate than vunational approximation The BMM nvodels w(th Gihbs vfcssffitn 
are already shown to have better performance ol* both claisthcatton resulu and efEclency of the inl'ererKe 
algontbmn (HI ISIS) 


2J DiiAiillunigD Munlc Carlo 

One popular MCMC toference method m Harrultoruan Monte Carlo |HMC)» A l so known as Hybrid Monte 
Carlo lUn . Hamiltonian Mortie Curio is built on the motecutar dynamics aod the advantage of HMC over 
raodom walk Metropolis and Gtbbt sampling Ls proposing a distant move wiih ahigh acceplance probabibry 
More receslJy^ ihe stochastic extensions of KMC are developed for fast sainpling. 

Ponnally^ we am lolerested in ihe posterior distnbuOoa p(9|&) rt np(^f/(9: C))« where if dersoles Ibe 
vanublev of interest and (' is the potential energy Ainclion in the Kumillooian dynamics Q Consider Ihe 


J 


ciM Nvhere a poji(enor diitntaitico Jointly UkM colo sccoujk Ihe pnor betief md dus. Tht totf^y 
hjnctkmwritceo u 


C) = - bflpo(^) - krsp<C|9), t5) 

whm 19 ihe prior aiMl p(C|9) s ^ Lkelihood ihe coaimn LiJ ii^sompOocQ 

Allot tnUDdooni auiiliiuy momontom vvublos r sod tis synuortTU positiveHjrAnitt nuL^ jI/« ihe KMC 
94i5np]erstrDulMes the joifii dtWibuiioru ^ ap P] - r^jl/ 

Ai^rrujip 8 differvnriiibto potonuil energy can use ao HMC wnpler to ifU'er the poji(enor 

discnbaiion via sunuJatiAg titf dynanuci with some discreOeation lotegrator^ iuch as the i^er v k^&ug 
SpecihciiUy^ ufiog tbe conventional le^frog iniegriuor with sieps^se /u tbe HMC method performs iJie fol« 
Jowini slepa: 


ii = ft +AW‘ff,» (6i 

’•.H =’•«*—I W»t,i|C), 

when Tij h loJtiiiJiaed i|^ ^ Kaving obtiujsed bompl&i ot (ilir)^ we discard tbe momenttim 

vanohle r aod gel s^unpJe^ of 9 from our imiei postcnoL 

Id pflTtir nhr if mjy qq^ leapfrog step ts used and A/ is set lo be tbe ideodty motruw we can obtain 
Langevin Morue Coiio (LMC). a speciid case ofHMC CD . 

To con^Kosoie tor the dijcntieattoo errce^ a MetropoLii^Hastlngi correcbon step is employed to retain 
the invonaneeof the target diitributiocL 


2*4 SliieliftUk Gnidlcat HMC 

One challenge of thegradieoi^baaed HhIC methods on dealing with massive data ie the etpeosive evaluation 
of the posteocr gradient V^U{9:t^) lb ave time^ on unbuised notsy gradient ewnaie T^V\9: 9) coo be 
coostracted by stAsampling the whole doto^et^ a to stochastic optloieation Bl l27t 

This idea was hrst proposed in HSI to develop the stocbatic gradient Uingevin dynamics (SULD). and 
was later extended by |Q for stochastic gradient HMC with friction aod by @ for siochaodc gradient HMC 
with thennostata. In these ttochaiic MCMC nsethods^ the gradient of tbe log-postenor is estimated as 

= (7) 

|i>l 

where ft ts a raodomly^lrawn subset of Since [D\ ^ [D\^ campuuog due ooisy grodteni e^bisate runis 
out much cheapen bence rendenog the oveioU nlgortibm scnJoble 

Vibnow bnefly review the siochaobc gmdieolHMC with thermostats, or stochatic gradient Nos^Hoover 
thermoMiii (SONHTl 0)^ SGNKT uses the simple HuJer cotegnitor and introduces a thermostat variable £ to 
cootnl the momenrum fluctuations ns well as tbe io|eeted ooiae Tbe dynamics cs simulated as: 

(fn, = r,-h^,r,- hTiU[6,\I}) + VXWO. hj 

< ft|.l a ft +/in* I (8) 

lc«ti = C( + A(ir|^,r,.,-l), 

where Ali the diiTustoo Co^rpanuneiffODd n is tbe dioenaiai of A and ru is coiUolieed hnro the sioadiud 
oorroal distribution «VfU.l| and £ii is tnltialteed oa A 

Such stochastic gradient MCMC methods ore showii to have a weak postenor^meao convergence iiw 
stead of a strong sniopJe^wtse convergence Such weak cosvergcice la sufficiat cn many rcol^worfd 

oppbcatkins. 

^ ihe npevucU lnnjo| scUifif. ihe b\ffiihooJ<teuUte n(D|a^ s rig^xu*lN|S^) 
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3 Stochastic S u bgradic nt MCMC 


One cam} pan )o oil Ibe obove HMC nsrihods cs iJie (ncxhasticj inkJioai of Uie lo{*po&tenor. However^ 
mcb 0 s^iKljeni nugbi noi always be avaibble lit thia secboo* we ijrve&tzsale a cd« {eoeral suhgTBdieot^ 
HMC method. ofuJyee its theoretical properties, and u»e tt for the fast jn^ervnee of Boye^ioo Itoeor 
SVMo. 


J*1 Subsradiojil HMC and Its Approidmule Dettilkd Baldnce 

Wbeo tbe log^poslerior nco^SerenOabte. grodieni^bosed HMC is oot oppUcablc Unng Oie more gen^ 
erol Bubgndiervu could poteotially od^ss this probleco^ m aiuilogy to the subfradieni descent methods in 
deiermijiisUc optimization l5^ . 

By plugging the potteiicr tubgndieni ^U\6f{D) to the ordifuiry KMC^ we come up with the subgroJieni 
HMC snih 0 leapfrog method as^ 


{ 


9,.1=9, +AA/'Snyj 

*1- I = '‘ti h'J ~ *) l®)> 


\9) 


where txj is uudiLliaed as ^ A/) and h cs tbe discred^sdon stepnze 

Hrom a theoretccol pcrrpective^ we may not be able to readily aoalyne the voluroe preservatitm property 
of the Homiliooian dyruimics with a non^diSerenliable potcrUioJ energy nor the detailed balance of a geoerol 
mbgmdieot HMC soropler (nsleocL we give an ^pnKtimated theoretical ODolysis based on teverni procdcol 
assumptions of the paisiual energy. 

Id pmcocaJ Boyesian rmdeU. the oon^smoothnesa of tbe portent often lies ui tbe hinge loss induced 
hkelihoods which are mainly considered in this paper These posteriors are continuous everywhere aod piece^ 
wiae smooth with ooty a hrnte number ol* ncn^srrsooth points The sampler wn]] hit those non^tTereobable 
stales with (ffobability zero Under ruch pracdcol atsumpoons. we shc^ the fc^kmng tjppnw/fi^t dcfcUcd 
ho/uncr property, whicb claims that the subgradient HMC sadiHes titf detailed balance property svitb a 
polyoomial smooth of the potential energy 

Vp'e dra give a polyrKimiiJ smccth of tbe polendaJ energy Uo. The continuous and piece^wiae diR'eren^ 
tiabJe postersw ia non^smoolh on a Eoite set s and then tbe c^oei^borhooda arauod aJI are 

debned as r] s {9|||f ^ S|H < e}. s s 1.2.«^By seamg ^ rroaU enough, the r^oei^borhoods 
can be mutually disjotoL if(S|.<) 0 r) s 0^ $j € Sa ^ ]* Using sucb mtituidly dis|ouM oeigiv 
borbccds. ^ \ will be conitnicted as 




im 5(av.r) 


where is a rDulu^dccsertstonat Hennite's interpolating polynomial |9 sottsfytng 


{ 


± () a t/o (* ± ( ), 
±t] = ± t). 


MOt 


Ol» 


According lo ibc dfhfuuon of f'> cun Hce ihjc L', ix smccih e^'crywhere. MorM\e(, wbcfi 

9 e e .S', we have 


y;i9) , ^9l/.{9f=3fC/oie}. 


( 12 ) 


WIkp (19 xmall enough, the poitenoc luboradtent* 'n used in Eqo.Qu opprovimaiely ibe same with 
VaL't uod iC will be scarcely possible Tor Uie sumptec to bil ibose oeigbboft)ccd» since Ihe meanure of Ihe 
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neighborhood u bounded by Then 8ubsr3dient HMCciui be equivalent iodruu*)ng somplei; tram 8 snooth 
posterior cotiesd WiOi ihis approxirodtion. ihe subgr^dient HMC s^iLshei; deiiuled bdance and m lhu» 
vsljd for geoeTBtiAg ^proximate samples (ms tbe mir po&tenor I (|. 

Vite give ui sUtrve iliosmtun of dsr tbeorrncai iinal)*i 4 i 
lo Rg [Tj we coiutruci ^rverti polynomal iooodi fun^oos 
t \ for 0 coobnuoui bu( noo^imooOi fuoctioo As cut be 
seeo^ when (iS as small as [XlS« Vo is t 8 ver^ close to (and 
li's veff unickely for a saspier to use Gsle sosples (sucb as 
100 samples), to hit ihe two neighborhoods B[^l^0Ai) and 
S(LU.15). 


3.2 SlocliftUk SubgradiojK MCMC io rnulice 

We can obtaia ihe veraica of stochastic mbgradieol L^ngrvifi 
dynosss (SSCLD) by replacsg ibe gnuiieai of the iog^ 
posterior with its subgrodtent. More tonrudly. ^2GLD gerw 
eruies samples by Himulating the following dynonucs* 



Hgujc I Uiuslrution 
&nooth Coosoucdoo 


of PolyoonUiil 


^asd^(*rn(C) 

j-t^V(0,/l* 




1)3} 


wherr A -c^loKPf^) - j§j6^bRp(C|il) la the Btcchuslic nony estimate ot* the subgmdieni 

In existing SOLD methods l?5l . (t is recommended to use a polynomjoJ decaying stepsue to save the 
MH concctwn st^ of the Langevin proposals. When the stepsoe properly decays^ the Martmv chain would 
gradually c^verge to tbe target po^en^ One subtle pan of the method cs thus on ruoiog the discreuaaQon 
stepsiee. A pre^specihed annealing scheme Of not chosen propoly) wouU make the chain eiiher cnisa or 
oscillate around the target More lecent work 1771 recommends some relatively opQmal scheme tor 2iGLO 
Inspired by adaptive stepaiaes for i8ub)gnMlient descent t AgaCradh methods IIQI . we. in this paper, adopt 
the same adaptive sttpsi^ setnog ourSSGLD methods [IT7I . As we shall see in tbe espenments. soch a 
scheme cs benehoal lo yield faster mixing speeds. 

We cao denve stochastic subgradient Homiltooion Moote Carlo Idiewise We adopt an improved version 
ol steehastK gradient RMC to derive our stechasuc subgrodient Nose Heever thecmostiit tSSCNHTk 
which generaies samples via the followiog iterations^ 

f ri4i * re • t^th • hSiO{it\t>) + v'SlVltJ* h) 

Again we omit the MH correction step and the SSGNHT simulations would penerote posterior samples more 
efdclently with the properly decoying itepsices and thermostat imtialiaation 


3.3 Slocitftsllc Subgradient MCMC Tor Bisyeaidn Linear SVMi 

The stochastic subgradicol MCMC can be used (or fast samphog of Bayesian linear SVM Let 9 s 
be the given mniog datasei where is the n^dimensiooal (earurr vector of the d^lh instance aod ^ € 
j )% the bcoary label We use linear classiHerx with a weight vector 17 g and the decision 

rule IS ooiumlly y s ego^r? /I Then for a Bayesiao bnear SVM model, we are toterested m learning 
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ihe posterior djjitnbuiioo p(Tf|t)) <x Pd(^) IX^ V*(nl^di Tttf pnor ih comnoaly set u a ^Lindiird oor« 
miil dlitnbuUoo Pu(^) ^ A^(Ui /)«iuid tbe pcr^dalucD tifuiomuJized like))hoc<i li rj) s mpi^c * 

TTifl ^| 0 i ^ y^ff ^^ 1 )^ 71ia« the nbgradirol of ihe los^pouehor ifivotves eva(uoaf\{ the subfndteol of Ibe 
oof^dtlTemiuible los^likcJihood 

= '-“‘">11 ( 15 . 

\0 l-i/guri<0. 

Wilb tbis subscadisiL we can use the slochosUc subfntbefU MCMC nseihod to do last aampbos for Ibe 
Bayesiao linear SVM foode) 

4 Fast Sampling for Bayesian Maii-margin M<»deU Hit!) Latent Variables 

We DOW abow hew lo levcn|e tbe above atochiiruc subfndisii MCMC nsetbeds lo derive fast auDpliiis 
alpontbma for BuyeMan niiix«fruu|in modeb with laieni vanabjea. We develop lilsonlhms for two dcffereni 
BMM mcdeli wtth lateol vant^ln. 

4* 1 FuM Simipliiiu for Mu^niurfLin Tbplc Mudch 

Pot paresecnr BMM models^ whose model parameter ouieber is bxeiL we jost coJculiMe Ibe (stochasticI 
]o(«postcnor subgradieol and ruo our stochoiuc subgrudieol MCMC method lo this pan^ we cue Gibbs 
MedLDA C?1 as on eaacnple lo sbou* t^w lo do fiut S4ijnpliji[ forpurametnc BMM nvodeli. 



Pipue •: GrapIbCiil model repre^eolsDon of Gibbs MedLUA 


4.1J Gibbs MedLDA 

Aa dlustnited ui tbe max^majpji tope model has two pana: I) a latent D^nchJei allocation model for 

inodebji( uodedyin^ lope suuctum of Ihe pven docameots and •) a mox^nariin dassilicr for predmins 
documol labels Tbe LDA pan is a hienucJiical Bayesian model which i^es an admurore of A' topicw 
4^ s ^ ^ 0 lateni documeol rrpretieolaboo. Here each tope ts a multinomiaJ dtatribuuon over 

j k'^'ord vocabulary and ba;s Ihe symmetne Dirtchlet prior D)r(d) For a smgle doctuneol d, words are 
{eoeroied ar>d the cklailed pmce«6 iit 

) draw a tope propomco ^ Dir(n) 

1 f« each word 1 < n < N^): 

<a) draw a tope assisruneol ^ Muhioonual($4)^ 

(bl draw the observed wordirifi ^ hlultmoaiiall^t^^). 
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Given 0 M of doccuneol^ W s wt denote lU loteol topic propomonH os 0 s ud M 

lopic Msigsmenut ^ Z s s Lei ij be the ovosfc lopic assifomcits of ihe wonis in 

dockuneol with element s ^ E(e^ s k). 

Vp'e u»e 0)e Gibbs cl^&iHertbnntiUtion to build the Citab% MedLDA modeL If we ho>e drawn a s^unpleof 
Ihe topic n5ttfiLnienl& Z and ihe dassiherwei^ts f( from tbe poatenordwnbodon. we cui|eitJie pcechcdon 
of the doinifoeot label € {li 2 . «^ as« 

P 

when s^) ia a \ector consisUnj of L lubvectorx with the ^«tb betog aod all othera being eero 
The correapondijig ejifitutd hhtgt less is 

Uj^niAX 

wtm 

Then. Gtbba MedLDA infen the loteol topic a^iignrDenti Z aod the cJasitAer weigbu t} by jolviog the 
tbJIowijig problem^ 

min +« (gfn, ©. Z,*jl. (18) 

when £ = KL(9||pti(r7,0. Z. #)) - E, [loR4p(W|Z, #)] l5 Uie nfonpukied objective wheo doing uaji- 
dard Ba>e9ian inference 







4vl J Fast Samplinc for fiibbs MedLDA 

lostead of sanipiijig lo the whoje Hpaee, which may lead to low efhcieney 1771 . we collapse out 0 and draw 
s«ifnp]e% foTTD the coUiipjied dlitrlbulioa^ 

pi W. Z. *. v|a.£j) = p(y7)p(*|dl J] p(wa, za\a. »)V'(VtfUtf.'?). 


when 


*•1 ^ ^ li-l 


091 


Cjx ^ ^6 number of w orris in docucneni d that is aligned to topic k aod Cf^xw ^ ^ number of worris \v in 
dockuneol d that is assigned to topic is detioed as« 




(lot 


Hot the collapsed posterior of MedLDA^ we cao sample classihers if tutog alcchaaUc subgradienlMCMC 
and sample the lope iruxle] pamoeter^ ^ tisiog tbe SGRLD method f?31 Wiih tbe nuBdomly^lnwn docu« 
msit nuGtbaich we get tbe stochasoc subgcadiatof ihe log posterior witbrcspeci to f( as« 


f flflla*cw=0; if©(w|2<.o) = l. 

\\, la*cw= -c2j,losiir = rirfi if< 1, 


(lit 


when V* = Here, U tbtf t/>ib wbvecloc of f} which ii correspondijis to Ihe 

mr^zero elements of Is) aod in the second case of the calculation, the hubgndients wtth respect to Ihe 



iiri5TMtimied ^u^'ectors; of m zero. Wltb the BlochanUc posA^of iub|nMlieni <Afi\b mpecl to we can 
use stoctuuuc Mibgnid^nl MC'MC to 5ia£npte t;. 

Vp^ Qse Ihe expaf>ded«ineM fcoDuliiiico for s |ir^|/(^^ l^^l) foQow the SORLO 

boos IQ suDpJe the adnsixlure 4^ oc ihe RtensaoGiiui CTumifoU (BqcL lO m ISt l. 

Tbe stochflfljc ponerior (5ub)cnidi£0t5 wh rcspecl lo 4 aod rf are cidcuUted jiveo the eipectalioo of 
: POI . TocaJculue the ezpectouofiof the G)bb$5ia£nphng itrraticna for the tope iuetgnmente of document 
d II as foUowt: 

pi^ = t|2i. .,.♦,*?) cp.fo+r^"i»b.v(vrfM;.oK <22» 

when :j is the topic aittgnmols of other dooimefiLt, li the average topic iissc|nments after settijig 
topic iii aod ts the number of words assigonieot iis topic k in dooimeni d after nmoviog wonl 
11 With the leumed topic adonziure 4^ aod clarifier wetfhis r/. we nodomly dnw a sample of ^ tkod and 
make pndicUoft^ as deicribed in 1421 The over^l uochasUc sampler tor Gibbs MedLDA is coocluded tn 
Algonthforn 


Alvritko I SSGRLD Fev Uihlis MedLDA 
documenU I pji/,^i tl s [ Ij 

Tnii> a1i>n Hfwi 

crpeal 

Draw u itochasdc subset *0 

Draw topic asitgnrDenu of the documeots lo 9 ostog Eqo. ^ 

Compute stochastic posterior (sub)gradtent with respect to 4^ and r/ 

Ruo subgndieju aasplff forf^ aod ^ with tbe stocbaitic posteoor aubgradieot 
uoUl Converge 


4 J Fust Simiplljig for InJtoke SVMi 


Another imponaot type of Bayesian max^argm motkls with Latent lariabtei uses Bayesian nonpankroethc 
poors. Such BMM models are defined oo intiojie^dimeoajonaj spaces and the size of the n>odela will be 
leanied fmoi tbe daUL Typical example of this type is tnlioile SVM P71 aod we use the HMC^witbin-Otbbs 
smegy to bmid fast sampling methods for this type of models. 


4 J J Gibbs mfinile SVM 

Real world data often have some latent clusteong structures, 
where nbilure^ot^perts models are generally capable of co|h 
tunng these local atmetures. When each cepen is a lioear 
^VM. the resuttunt mixture of SVMs learns a non-linear rmdel 
instetMi of smelly a I (near one (S \II\* Recent work further 
pcesenis a oooparumetnc extension, toftoile SVM (iSVM) B7I 
I See Fig«[^« which automabcaUy mferv tbe oomber of experts 
Below, we apply tbe subgradieni-baaed fast sampling method 
tointinJie SVM 

Grveo a set of data 9 s we let deooie the 

component assigomol for the Escb componeni la 

associated with a linearclaaaiber aod o Gouaaian likelihood 
C^) to describe the input featuresn All tbe parameters 



Figure 3 Graphical model rcpceseniauon 
of Gibbs iSVM 


likebhclitd 

















t6)]ow ^om prion: a Uiindard Gau^siao pncr for and a Gaus^i^n^Invmo^Wiahnrt conjugMo pnor for 
[fi^ E). Id iSVM. choose a Chffi^se RfUC4frunf Process (CRP) [221 foe Z. 

Tbougb aJleroanve approaches oiiit^ we deline tbe expert cliitfifieriia a Gebbs das&ihcrld geluficertikjory 
fortbe nasigiunertLs Z and iheclasbiAex wcigbta rf. Namely^ gSeo the posteriordjruilMUon <i\Z^ ihe Otbbs 
dssfiherdrawa acomponeit oasigomani ood acIaasiAerfii^ for each Hiiia poiolx^ and coakea p/rrLcooa 

tp 

where ii a kmg vector coni)MjQ[ of L iubvec(or\ wiUi tttf iy4 being and all oihen being aero 

We adopt the e:spected hinge loit forGitaba iil VhL 
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0,f + s*<) - 'rf. Jb) 


) 1 - 


1241 


Together with the Gibbr dasulier aod the eitpecled hinge ioss, ve can dehoe a RcfiSisycs rrwdel for the 
nualure of Uibba ciBssiticn 


mm £(g( 2 .rj.-r))+C'(i 5 ) 

where s (^,E) are the mean and vanance para/eeierx for each Gaussian cooponcit and £ « Z)^ 

E^[loK(pfX|^^ *>]] ia the ob|ectrve function wben doing standard Bayeaian infcicoce. With reganJ lo Ibe 
Re)f8o\ts focmulaOon tn Bqn [7] the ncoeaiiaed postencr desoebuoon of inHjiite S VM is 

p 

9(2,t7,'y) <261 

•J-) 

H'henv’(v«|zB<’?(j) = sa:p(-e fDQX(UJ + mAx^.j/(p, z«. >«) - /(VBt ^Bt 2 b)))> We rel'et wadOTto IWl 
im fer non details. 


*22 Fast aanpliog for Glbba iSVM 

We develop tbtf fast sampling method for Gibbs iSVM by incorporeuog the stochaMc iubgndieni MCMC 
otfihod wiibin Ihe loop ofa Gebbs aampler Tlie HMC-witblfvGtbb^ streiegy for tSVM is detailed below. 
For Z. Give ihe coodioonal disocbuoon is 

plZ\n) « M2)p{X\Z]v [ y \Z. 17 ), i 27 1 

where p(X|Z) s ^is the marginjt dissibuhoo via coUapsirtg y and p^(Z) ta Ihe CUP 
pnoc r)|) be the bypo^pararDeler of the CRP prior and n ^4^1 ^ the number of poinis ihnt belong to 
component k except d. Given dassificn if and assigoments of other data poifus Z we sample coorpaoent 
astigruneoU 2^l by normaliaiog the following twoprobabtbtiea (enslmg component and a new component) 

1> p(:a =k\2.i,f}] a.n^tj,9lvt\zt = k.fji)~p(xg\Z.g.X^^} 

2) p(i4 =n#*i|Z a,f?) ■» c^iP<zj)/V'(SdlV)hj(V)dV 

Id case 2)» p(x^) s J p(s^|^)pa(T)^ ts tbe hkebhood of the data d aod can be corrqiuted in closed^'onn 
using the conjogaie (npeiry. The second buegral to 2) can be approidcnnied by using importanre 
samplini. 


10 


for Give ihe oumter of active c ItiBler i% koown We need to ethcieolly »dfn|3le the cUs^iher wei^hlft 
7 ^^ of each compooeni k Ctocd the fclloti mg condiOceiol di wbution^ 


pl,ru.\Z) K 


[J Mfalaa.TJ*,). 

dio-J. 


(28) 


where ts z nandimJ ncrmal prior WiUi our proposed ^tocha^Oc ^ubinkJieoi MCMC« ihe cbasiliers ff 

coo be direedy sampled uso{ only a Giijubaich c( wbnie damsel Here« we five the atochasdc subgradieols 
of the log conditiooal dislnbutkm^ 


^>oS 


PofThJ fl 




m 


where ihe sob^radjefU^ of Use mulU<Ias^ hioge loss ^ nouJorty delioed os £<^0 Usuis 

ihu subfradieol to the ^SGLD (£i)iu or SSGNHT <Eqn we can derive Oie uocbasUc subindient 
lAoer sampJer for clasiihen ti 

The whole siochasUc HMCiLMCKwiihlo^Qlbbs alsonltiffl structme ts ouilioed in AJsonthnL 0 


Algorilkn 2 Siochasoc HMC wttbio Gtbhs for toluuie SVM 
Ijq^x [X4.y^\,J s ^ ^ A\ hatchsice S. 
loibalmoocm 

repent 

Bomple ^ stveo r; 

sample }} gtveo z u^ng stocKosoc subgradtent HMC 
uoUl Converge 


S Experiments 

We now implement our stochas;ttc subgradieni MCMC on various Bayesian ma;i«majgin mxiels^ locluding 
Ihe basic Boyesian linear SVM and two scphrsticaied Bayesian oax^margin coodels with latent vnjiables 
<CtSVM and Gibbs MedLDAi. Our remits demonstrate that stochasdc siAgraibeni MCMC cao achieve 
great tmproiemeoi oo time efiiciency and meanwhile stiU geitfrattng accurate po&ien^ samples. 

All espenmenls are done on a desktop computer with 6in|]e<ore rale up to 3.0GH2. The slepsiae 
parameter al iteration t decays via ht s * (1 + t/b] ^ Nomally. we set 6 s l fcr SVM cJasdher 17 
aril b s itK) fv topic^word parameter 4 ^. We choose and "v via a gnd search. Funboinore. ihe AdaOrad 
nep^izes are cooscdered for stocbaslic subgraiheni Langevio dynamics roelhod 

5*1 Bayesiao Liatar SVMs 

We Ar^t cooidder ihe ba^ic Bayesian linear SVM model and coropare our stochasdc suhgracLent sampling 
methods with the Gibbs sampler wiih dam augmenmoco 1421 and the rar^Jorn waJk Metropcbs wribstochas^ 
be MH Icsl QT) (stochastic raodom walk Metropolis. SRWM). 

5 * 1*1 VcMilU on Syathclk Data 

We Hn teal our methods 00 a 2 D synih^c damaei to sbow that our methods give correci samples from the 
posienor dlstrlbuboo. Note that we view the multi ot this experiment as a Himple preef of idea and hence 
choose the more direct visuaJ conqiariion We follow Ihe Boyesiao Itoeur bVM model debned io 5 eccon[yT] 







afid lU(X)obsn>iitloni tte i^yolhet(c doUMt SpeciAc4Uy. gneraif leslureH / Itocd o uotbrm 

dutrtbuixon I 1) und Ok coefEclCTt vector fnm a oomuJ di^tMtcoA ^ V(0. Given the 

teuurev and coefticlenti. Oie Ubeli m generated trom tbe Bemcmlli dt^^mbution wlO) parAinetrr ($« ivbere^ 

6 * 

i' fV( - 1 |s,.^J + ^(>1 ■ -11».. *71 * 



Figure 4: VUuiii conipansco of posterior samples 


We compare tbe sa/spks obtaioed trooi SSCLJ^ ami SSGNHT with those from the data augroeolaocn 
rrseOiod whjch is an accurate sampter for Ba>es)an SVMr We take 5«(XX) ftanqiles for each nKtbcd iifter a 
stiHioenOy long bom^in stage and give tbecoppanson m Fig ^ where Oie deni)Ue% ot'Oie obtained sample) 
arc ibowo via ihe g^ayscaJe^ of the gnds remits suggest thai our stochastic subgradieol MCMC nseihods 

arc oectnie^ although the stochastK subsaropltog aod the neglect of MH tesl bring soroe ooise. This rvmii 
IS compaobte with the previous weiifc convergence iinalysis of the ordioar)* KMC nKthcds BlIZBl 


0.1 J KcMilU on KeuI Data 

We dtfn tesl two siochasbc subgr^eni MCMC riKOiods. SSCLD and &SGNHT oo Ok Realsun dataset Q 
and the linger CCI Higgs dainaei Q. The Higgs dataset cootuios 1 t iv li>' saropiei tn a Th^dipenatoruil 
feature space We rwdonily choose lU^ samples as the traiotng set and the rosi us Ok testtog set 

For the Realsun dalaseu we set the stocbasDc batcbsise \V ^ lU for all stochastic talereoce roethods 
For Higgs daiatet. we set \V\ to be l.OUO for both SSGLD aod SRWM aod |f^| » 100 for SSOHNT 
We tine runed polymrrual deca)tng stepsiaes for stcehosuc subgradieot MCMC riKthcdi and ipecihcally 
tar SSGLD. we prefer adaptive uepsiae AdaGrad^ which has been succensfully applied to Ihe stochiuuc 
i sub igrulieoi descent HTH For SRWM. the vanaoce parameter is set as U 01 Tbese lum to be a good setting 
onal^ttd to the following teasitivity analysit la SecOon ? I 

The convergence curves ol* various nKthcds with respect to Ihe nuioing time cn botb daiaseu are shown 
in Fig.Q We cao see that ourstcchosuc subgr^ieni MCMC methods are several magnitudes faster than the 
bosehoe rpechods. Coropared with the Gibbs sampling with data augmentation riKthod. stochastic subgradi^ 
cm MCMC mcihods get much cheaper updates aod hence arc mexe scalable. SpeciaQy for the larger Higgs 
a single updaie of Gibbs itampLog cs net hnithed when the stochastic subgradient MCMC get cem^ 
verged Furthermore. olOiough both ^RWM aod atocbastic subgradient MCMC use stochastic rrunibuiehes. 

** * tp <Si9 .ntu.#du»twy ^ ^l*n/liOswtcolsyd*l4t#tsyDiAJf mi 
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Figure 5: HxpePcneniiiJ mulu of Bsye^i&o linear SVMs 


siocbfisUc nbgnidteot MCMC foeOicxls nux mucb faster SRWM becau»^ Uie po^tmor subgradieni 
tbnruiijon provider iho nght direcUoo to ihe ini^ poilerior 


S»13 !!efisitivitv <oilyiis 


TtiQJog the buchsire |&| fcBects iui occmcy^Bicieiicy nde<ilT« analogous lo the bxa^vanafice oadraif ifi 
atochaBUc Monte Carlo tamplmg fTTl ^ In genenii^ u^n[ a ajnaller batcbaize often leads lo a brger injecled 
iKHse« bui the compuiauon cost al each itenuon }% ledoced^ wNch is linear to the batchsiee (i 
Wbo dniog cross vaJidatkin lo select paraisetfirs. both accuracy and Ome ef&cieocy m key factors ihM 
should be taken )nio consideraoon 

F)g ^present^ the seo^iOvity aitaJy^is of iJie baich^iae for the two stechasoc subgradieni MCMC nseibods 
on both Ktggs and Realstm datasets Tbe perfonruuKe of our stochasttc subgradjent MCMC appears to be 
tairiy promising except fee crtremcly boy betcbsi 2 es 

Id our espenments^ adapbve ^tepslze^ (AdaGradl bring a 
better ruiimg rate ibao the pclynormaidecayini aiepsizes. This 
may result from the flexible stepstee decaying al different di« 
meniioos We aJsc give an empirical analysis in Fig 0 As can 
be teeL for the Higgs datasets adapdve stepsiees bnn| beOer 
dassiheabon remits than the pre-deHned poiyiiomiaJ^leceying 
rteptiaes. 



5*2 <jlbb^ niiix^niiii^n Topic Models 


Adapina dapsce 
^^Pta^cMIned slapuaa 


aa 


IStt 


Rgure 7: 
AdaCrad 


THT 

tiaralioria 

Performance of SSULD with 


Now« we urtplemeni the fast sampling for Gibbs MedLDi^ Wt 
show tbe efAciescy aod accuracy of our ttwhasbc subgnuheoi 
Riemaonian Langevin Dyoamics (S5GRLD1 using the ^Uoews 
dataset and the larger Wikipedia dalaseL FoUowtag the dataset 
setting to 1421 . the slop words are removed accordcog to a uaiw 
dard lisL We compare our SSGRLD wiih tbe data augroeototion <Ctbbt MedLDA) 03 aod its oewly de¬ 
veloped octeaston in the online Sayestao paasive-aggressive leanung framework (paMedLDA^bbs) 1311 . 
Fcr the aouiUer 20oews dataseL tbe cntolved three methods all use the binary version and then adopt the 
""ooe^va-air' strategy for miitu<lass classihcati^. For the larger Wikipedia diitasei the &SORLD method 
u^es tbe mulUKlaaa settuig and other iwo use the mulb-iask formulatkin as described to PTIH?! 














SSGLD SSGNHT 



Figure 6 Seostivlty of the batcbsii^ putiimet^ lor botb iuid SSGN HT oq Higgs datibiM 

<hnt n>w): and iho dalasrt isvcond rovl. 

5 JJ CJisiAcaHoo FcriormaDcc 

We tot teat on the .Onews dataset wtuch consuls of I UM tnkioing dociuneols and 20 caiegones. We set 
Ihe byper^panmeters as ct s l. s l. r s f s l&.i u luggested lo 1421 Fig [^left> shows the number 
of dockunols processed in order to reach a specific accuncy score^ where lope oumber is set as 30. As we 
can see^ tbe rwo stochasoc samplen use much fewer documents aod efficiently explore ihe data redundancy 
by using a muubaich at each iteranon 

Then we leal on the larger WOcipedia daiasei which coasUls of 1 I million Iratrung dccumenis and 20 
categoneit. We use Ihe same byper^pantmeter seidng uiih the 20news datasets except for a &ew settings* 
s t s lOb for SSGRLD and ^ s I for both Gibbs MedLDA aod paMedLDA-gibbs. We set ihe lopic 
Gumher as MK Hg. shows tbe as a hiocO^ of dme. Il cao be seea lhai SSGRLD prodices 

companble classihcaticai results As for Ihe efhciency^ both SSORLD and paMedLDA^gibbs are ooe order 
of magoilude more elhcieni than tbe pevkHis Gibbs MedLOA Tins is due to tbe mini batch training. Mean« 
while, allhough in tbe same magrulude. SSGRLD is soU faster than paMedLDA^gibbs. We argue thai ibis 
IS because SSCRLD does not use augroeoied vanaUes and dtmctly draws samples Irocii the SVM ciasstfio 
Moreover, tbe matrix inversioo tovol>ed in the ^ta augmeolation techoique is costly in Ihe whole (Tccedure 
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hipwe 8* Empijical reitilu methock tbr MedLDA 

5 J J Topic RcprestnUlion^i 

we visualize tbe d)&covered lop)c repcesenLiUons of SSGRLD oo ihe lOnews dalzseL For the ell 
20 cdtejcneo^ t sJiow ihe av'cnje lopsc rrpR^enlzl^oos of the doctuneol^ form eoc.h category As we can 
see m Fig 0 the average icpk di^UitMfOon for the cone^porulirig cla^sider la very aporse lonJy cm cir two 
oor^zero eotnea). olao give ihe isoat represeniaicve icp word^ of the soLent topk(s) of each category ifi 
Table. [T] We coa m thol ihe lop worda of ihe ^lent topic(6) ore highly related to the category lAfonaaooo 
¥of esanqile. the salienl topic teamed by claaeilier scijfoce baa the top worda aa NASA« launch, moon, 
sole]hie. etc. Theee puitenti ore atcnilor oa those 10 13II [421 

Table I RepreBefitative top words ol* the aalieol topic(a) 


CBTSory 

lop wQiOa 

Ctltfcty 

lop Mrdi 


gD& Mi« itteum 


liH|e*Jpeg.Ue 

wtrtdows 

wirtdows. Cite, card 

pc 

SCSI, drise. disk. mb. doa 


mac. appie^ dnvc 

windms^a 

wtodow* ser^eti lUe 

forsale 

aDonymity. sphiox 

rec.autos 

cor. rsgior. speed 

motcKycle 

btkc. nde. bmw 

baseball 

learru game, rurts 

hockey 

team, ohh seas^ 

crypt 

key. chip, security, law 

ckctnmics 

power, circuit, wue 

medtcoJ 

food, medical, doclor 

space 

rtzao. launch, earth 

chnstian 

god. jesus. church, table 

guns 

guru weapon, hrvarm 

mtdeast 

laroet. (iirtdHh. jewa. arab 

polMka 

mr. preSdeDt states 

rebgioa 

jeous. bible. Christian 


SJ InJiiiile SVMs 

The proposed aubgr^ieni^boaed sampling methods coo abo te used for fast ioferenceof tohrvtte bVM 
a Dinchtei process coixture oflarge^margcn kernel cnachjocs. 

Vp^ chcxase (wo datasets. Pmteui ood UCNNl. to test our methods. Tbe Proteifi dotosei PB was created 
for Proteui fcsld cloaaihcatiooa and cortaists of samples aod 27 cloaaes with 21 feanues. Tbe UCNNl 
datoae^ is ongtoaied foocn oji engine syatep binary classihcation problem and cooaiats of 4d.4M0 mrujig 


X p: //csLa.cvtu.«du.iw/*c^I]iA/llbsvvicoIs^dAtds#cs/bir>dry•bcjd 
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Fisurt 9 Vi^uaLiaticfi of loam topics by SSGIUJ2 


sifoplea With 22 fcsturev 

Vp^ impkmefii (tvo cntercoce methods tor iSVM uicluding SSGNKT wiUuji C)bbi lAigonthjn.j^ und 
GiUa u)lh data augmefitALco 13^1 Other models are also cmplemefited for compansoru such 

as CDultioocQjal Icgu mode) iMNL)« linear SVM« RBF^ VM aod DP malm o( gtnecaliaed Itoeor models 
(dpMNL^ net We use crots^validiUiciis lo choose hyper^paraoseiers and get the mulls )o Table 

Vit cao see ihal aorUinear models using a mriture^of^pens, such as GiSVM and dpMNL are soperior 
in clawhcahocL lo the uochasdc subgradient MCMC\ r/ sampluig step can te dramaucally accelerated, with 
companble or even better prediction perfbmunce This supenonty results frrm both slcchastic subsampiing 
and tFvoiding Ibe nuunx inversion lo ihe data augmentation technique. 

6 Conclusicms 

NVe sysleauDcaUy iovestigate ihe lost sampbag methods fur Bayesiao max^rsargto models NVe tinl study a 
geoeniJ sobgrucbeni HMC soropEtog mmhod and teveraJ stochastic vanaols including SSGLD and SSGNKT 
Theocelical analysis shows the approximated detailed balance of ihe proposed stochanlic subgrodieolMCMC 
rrsethodi Tttfo we apply the stochastic subgradient samplen to Batesian lirtearS VMs and twosopbisdcaled 
Bayesioo roax^nuugm models with latent vanabies (GiSVM ondCibbs MedLDA). ELsleastve empthcaJ snuF 
les demortsuute ihe elTecdveness c( the stochasQc sobgrodient MCMC mmbods on cmproving bme efbuertcy 
while rrwuntainjng a high accuracy oi* the samples. 

The lUeogtbs of our methods are 1 [ fast ini'erence for BMM morkis compared with the (nvious Gibbs 
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U.i« 

Hnttia 

AccU%) Tifoelorfi Total Tinu 

IldWI 

A<m(S() Tune tor Total Tuse 

HRC 

Linear SVM 
RBKSVM 
dpMML 

~5inS ^ ETTB 

5U8 • a03 

S1.1 - 0.11 

56,3 7.64 

~50 ^ TTl 

fll.O - 0J6 

93.^1 - 2.79 

94 0 - 7.62 

Gibbs-kSVM 

SSONHT-lSVM 

Sfl±Q.Q !i.3l±OJ7 I5.I5±0 29 
S61±0Q O.1710B2 7,32 rOJd 

MJ±0.7 9n±09S 217l±l 16 
94J10.S 1.17*0.08 13,64*1.90 


94imphnB meUiod witb oufocMMjon. 1) acctiniie s«imphn| ivhjch ii os |ccd ihe Gifabi sunplijig 
wilb dsiA iiii|fDeouoQn 3 k iipphcooon^ to non<oojii|;air po&tenor uimpljo| which ciiAoat be iimply 
BccooTi^shed. However^ wheo tbe 4iato sices of the appEtcaociis ost loo lor^ lo be processed (ft o siogle 
muchtne^ it u stilt dif^cuit to use only siochasbc subgTBdient MCMC lo soJve the proUefn 

Vp'e conRider the hjlure wori^ in three catejones^ ut|onlhm«teveL rmdeUle>et uid appbcotkn^leveL Hit 
ihe propQjied algorithm itsetr ihe future work includes further scaling up using pomllel compuiiiijoo Q). 
For ihe model settuig^ the furure v»ork includes ^plying our method to other models with cooiinums but 
oor^saooth posienors^ rueh as sparse models with LaplecuA priori At the appliciiijoo Wel^ we cocunler 
UBUig our method to scaJe up sevcrul Bayesiiin max^nuugin rmdels that are used in intelligent systems, such 
IS nonparantetJic ma;i«margin matric fiietoncatioo for colliihcmti^e hltenng 1361 

The btg dau Is uJentihed aR an importiint buddmg biock of toleihgeot systems ITS fTTt and the fiisi 
infeicoce i^ becoming a cennJ ckmeot therein 1221 . For related Sayesiiui modeU 1241 . big leorstog with 
Boyesiao models is one of the recent reseorrh tccuses HTT PanicuLiriy. the Bayesian roax^rruu^n models 
are welt utidjed tar >iuiotiH machioe learning applicattons. hut they still Inck fast inference rrtetbods Our 
method cccamplixbes fast samphng forihe BMM models, wbixh will be used in forure large scale inielLgeni 
systems. 
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